46 research outputs found
Mobile Machine Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective
Machine learning is playing an increasingly significant role in emerging
mobile application domains such as AR/VR, ADAS, etc. Accordingly, hardware
architects have designed customized hardware for machine learning algorithms,
especially neural networks, to improve compute efficiency. However, machine
learning is typically just one processing stage in complex end-to-end
applications, involving multiple components in a mobile Systems-on-a-chip
(SoC). Focusing only on ML accelerators loses bigger optimization opportunity
at the system (SoC) level. This paper argues that hardware architects should
expand the optimization scope to the entire SoC. We demonstrate one particular
case-study in the domain of continuous computer vision where camera sensor,
image signal processor (ISP), memory, and NN accelerator are synergistically
co-designed to achieve optimal system-level efficiency
Measuring scheduling efficiency of RNNs for NLP applications
Recurrent neural networks (RNNs) have shown state of the art results for
speech recognition, natural language processing, image captioning and video
summarizing applications. Many of these applications run on low-power
platforms, so their energy efficiency is extremely important. We observed that
cache-oblivious RNN scheduling during inference typically results in 30-50x
more data transferred on and off the CPU than the application's working set
size. This can potentially impact its energy efficiency. This paper presents a
new metric called Data Reuse Efficiency to gauge the RNN scheduling efficiency
of a platform and shows the factors that influence the DRE value. Additionally,
this paper discusses an optimization to improve reuse in RNNs and highlights
the positive impact of this optimization on the total amount of memory read
from or written to the memory controller (and, hence, the DRE value) during the
execution of an RNN application for a mobile SoC
Energy Efficient Hardware for On-Device CNN Inference via Transfer Learning
On-device CNN inference for real-time computer vision applications can result
in computational demands that far exceed the energy budgets of mobile devices.
This paper proposes FixyNN, a co-designed hardware accelerator platform which
splits a CNN model into two parts: a set of layers that are fixed in the
hardware platform as a front-end fixed-weight feature extractor, and the
remaining layers which become a back-end classifier running on a conventional
programmable CNN accelerator. The common front-end provides ubiquitous CNN
features for all FixyNN models, while the back-end is programmable and specific
to a given dataset. Image classification models for FixyNN are trained
end-to-end via transfer learning, with front-end layers fixed for the shared
feature extractor, and back-end layers fine-tuned for a specific task. Over a
suite of six datasets, we trained models via transfer learning with an accuracy
loss of <1%, resulting in a FixyNN hardware platform with nearly 2 times better
energy efficiency than a conventional programmable CNN accelerator of the same
silicon area (i.e. hardware cost).Comment: 4 pages, 2 figures, NeurIPS 2018 on-device ML worksho
SCALE-Sim: Systolic CNN Accelerator Simulator
Systolic Arrays are one of the most popular compute substrates within Deep
Learning accelerators today, as they provide extremely high efficiency for
running dense matrix multiplications. However, the research community lacks
tools to insights on both the design trade-offs and efficient mapping
strategies for systolic-array based accelerators. We introduce Systolic CNN
Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based
cycle accurate DNN accelerator simulator. SCALE-Sim exposes various
micro-architectural features as well as system integration parameters to the
designer to enable comprehensive design space exploration. This is the first
systolic-array simulator tuned for running DNNs to the best of our knowledge.
Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect
of bandwidth, data flow and aspect ratio on the overall runtime and energy of
Deep Learning kernels across vision, speech, text, and games. We believe that
these insights will be highly beneficial to architects and ML practitioners
SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers
The vast majority of processors in the world are actually microcontroller
units (MCUs), which find widespread use performing simple control tasks in
applications ranging from automobiles to medical devices and office equipment.
The Internet of Things (IoT) promises to inject machine learning into many of
these every-day objects via tiny, cheap MCUs. However, these
resource-impoverished hardware platforms severely limit the complexity of
machine learning models that can be deployed. For example, although
convolutional neural networks (CNNs) achieve state-of-the-art results on many
visual recognition tasks, CNN inference on MCUs is challenging due to severe
finite memory limitations. To circumvent the memory challenge associated with
CNNs, various alternatives have been proposed that do fit within the memory
budget of an MCU, albeit at the cost of prediction accuracy. This paper
challenges the idea that CNNs are not suitable for deployment on MCUs. We
demonstrate that it is possible to automatically design CNNs which generalize
well, while also being small enough to fit onto memory-limited MCUs. Our Sparse
Architecture Search method combines neural architecture search with pruning in
a single, unified approach, which learns superior models on four popular IoT
datasets. The CNNs we find are more accurate and up to smaller
than previous approaches, while meeting the strict MCU working memory
constraint
Efficient Residue Number System Based Winograd Convolution
Prior research has shown that Winograd algorithm can reduce the computational
complexity of convolutional neural networks (CNN) with weights and activations
represented in floating point. However it is difficult to apply the scheme to
the inference of low-precision quantized (e.g. INT8) networks. Our work extends
the Winograd algorithm to Residue Number System (RNS). The minimal complexity
convolution is computed precisely over large transformation tile (e.g. 10 x 10
to 16 x 16) of filters and activation patches using the Winograd transformation
and low cost (e.g. 8-bit) arithmetic without degrading the prediction accuracy
of the networks during inference. The arithmetic complexity reduction is up to
7.03x while the performance improvement is up to 2.30x to 4.69x for 3 x 3 and 5
x 5 filters respectively.Comment: Accepted by ECCV2020 Conferenc
Learning low-precision neural networks without Straight-Through Estimator(STE)
The Straight-Through Estimator (STE) is widely used for back-propagating
gradients through the quantization function, but the STE technique lacks a
complete theoretical understanding. We propose an alternative methodology
called alpha-blending (AB), which quantizes neural networks to low-precision
using stochastic gradient descent (SGD). Our method (AB) avoids STE
approximation by replacing the quantized weight in the loss function by an
affine combination of the quantized weight w_q and the corresponding
full-precision weight w with non-trainable scalar coefficient and
. During training, is gradually increased from 0 to 1; the
gradient updates to the weights are through the full-precision term,
, of the affine combination; the model is converted from
full-precision to low-precision progressively. To evaluate the method, a 1-bit
BinaryNet on CIFAR10 dataset and 8-bits, 4-bits MobileNet v1, ResNet_50 v1/2 on
ImageNet dataset are trained using the alpha-blending approach, and the
evaluation indicates that AB improves top-1 accuracy by 0.9%, 0.82% and 2.93%
respectively compared to the results of STE based quantization.Comment: conference version accepted by IJCAI-201
High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands
Matrix multiplications between asymmetric bit-width operands, especially
between 8- and 4-bit operands are likely to become a fundamental kernel of many
important workloads including neural networks and machine learning. While
existing SIMD matrix multiplication instructions for symmetric bit-width
operands can support operands of mixed precision by zero- or sign-extending the
narrow operand to match the size of the other operands, they cannot exploit the
benefit of narrow bit-width of one of the operands. We propose a new SIMD
matrix multiplication instruction that uses mixed precision on its inputs (8-
and 4-bit operands) and accumulates product values into narrower 16-bit output
accumulators, in turn allowing the SIMD operation at 128-bit vector width to
process a greater number of data elements per instruction to improve processing
throughput and memory bandwidth utilization without increasing the register
read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size
SIMD instruction offers 2x improvement in throughput of matrix multiplication
in comparison to throughput obtained using existing symmetric-operand-size
instructions while causing negligible (0.05%) overflow from 16-bit accumulators
for representative machine learning workloads. The asymmetric-operand-size
instruction not only can improve matrix multiplication throughput in CPUs, but
also can be effective to support multiply-and-accumulate (MAC) operation
between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators
(e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar
improvement in matrix multiply performance seamlessly without violating the
various implementation constraints. We demonstrate how a systolic array
architecture designed for symmetric-operand-size instructions could be modified
to support an asymmetric-operand-sized instruction
Ternary MobileNets via Per-Layer Hybrid Filter Banks
MobileNets family of computer vision neural networks have fueled tremendous
progress in the design and organization of resource-efficient architectures in
recent years. New applications with stringent real-time requirements on highly
constrained devices require further compression of MobileNets-like already
compute-efficient networks. Model quantization is a widely used technique to
compress and accelerate neural network inference and prior works have quantized
MobileNets to 4-6 bits albeit with a modest to significant drop in accuracy.
While quantization to sub-byte values (i.e. precision less than or equal to 8
bits) has been valuable, even further quantization of MobileNets to binary or
ternary values is necessary to realize significant energy savings and possibly
runtime speedups on specialized hardware, such as ASICs and FPGAs. Under the
key observation that convolutional filters at each layer of a deep neural
network may respond differently to ternary quantization, we propose a novel
quantization method that generates per-layer hybrid filter banks consisting of
full-precision and ternary weight filters for MobileNets. The layer-wise hybrid
filter banks essentially combine the strengths of full-precision and ternary
weight filters to derive a compact, energy-efficient architecture for
MobileNets. Using this proposed quantization method, we quantized a substantial
portion of weight filters of MobileNets to ternary values resulting in 27.98%
savings in energy, and a 51.07% reduction in the model size, while achieving
comparable accuracy and no degradation in throughput on specialized hardware in
comparison to the baseline full-precision MobileNets
Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration
Convolutional neural network (CNN) inference on mobile devices demands
efficient hardware acceleration of low-precision (INT8) general matrix
multiplication (GEMM). Exploiting data sparsity is a common approach to further
accelerate GEMM for CNN inference, and in particular, structural sparsity has
the advantages of predictable load balancing and very low index overhead. In
this paper, we address a key architectural challenge with structural sparsity:
how to provide support for a range of sparsity levels while maintaining high
utilization of the hardware. We describe a time unrolled formulation of
variable density-bound block (VDBB) sparsity that allows for a configurable
number of non-zero elements per block, at constant utilization. We then
describe a systolic array microarchitecture that implements this scheme, with
two data reuse optimizations. Firstly, we increase reuse in both operands and
partial products by increasing the number of MACs per PE. Secondly, we
introduce a novel approach of moving the IM2COL transform into the hardware,
which allows us to achieve a 3x data bandwidth expansion just before the
operands are consumed by the datapath, reducing the SRAM power consumption. The
optimizations for weight sparsity, activation sparsity and data reuse are all
interrelated and therefore the optimal combination is not obvious. Therefore,
we perform an design space evaluation to find the pareto-optimal design
characteristics. The resulting design achieves 16.8 TOPS/W in 16nm with modest
50% model sparsity and scales with model sparsity up to 55.7TOPS/W at 87.5%. As
well as successfully demonstrating the variable DBB technique, this result
significantly outperforms previously reported sparse CNN accelerators